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1.  Introduction 

[UC]  introduced  the  idea  of  an  ultracomputer  and  reviewed  algorithms  for 
two  permutation  problems:  The  "static  permutation  problem",  in  which  an  algo- 
rithm is  tailored  to  each  specific  permutation  it  (given  in  advance),  and  the 
"dynamic  permutation  problem",  in  which  one  idgorithm  must  effect  all  permuta- 
tions (i.e.,  the  permutation  is  part  of  the  data). 

We  desire  to  permute  N  items  Wq,...,Wj^_j,  in  an  ultracomputer  containing  P 
processing  elements  (PEs),  PEq,...,PEp^.  Under  the  assumption  that  N=P  and 
that  WjC  PEj,  [UC]  gives  the  following  worst  case  analyses:  The  static  permuta- 
tion algorithm  requires  4  log  P  -  3  data  commimication  steps,  and  the  dynamic 
permutation  problem  requires  6((log  P)^)  data  commimications  steps.  It  is  easily 
seen  that  for  both  algorithms  the  average  case  behavior  closely  approximates  the 
worst  case. 

Here  we  present  a  data  motion  algorithm  oriented  toward  average  case 
rather  than  worst  case  performance,  and  supply  an  argument  suggesting  that  the 
average  niunber  of  data  commimication  steps  required  is  approximately  3  log  P. 

Note  that  [UCNll]  considers  ultracomputers  containing  L  items  in  each  PE 
(thus  N  =  LP)  and  presents  algorithms  plus  worst  case  analyses  for  both  permu- 
tation problems  defined  above.  The  static  permutation  problem  essentially  scales 
up  linearly,  requiring  4L  logP  -  L  data  commimication  steps;  whereas  the  dynamic 
permutation  algorithm  improves  in  efficcncy  requiring  e(L  log  P)  data  communi- 
cations steps  (providing  that  L  is  large  enough).  Again  the  average  case  behavior 
closely  approximates  the  worst  case. 


^This  work  was  supported  by  DOE  grant  DE-ACO2-76ER03077  and  by  NSF  grant  NSF- 
MCS79-21258. 
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2.   The  Algorithm 

Let  us  assume  that  N  =  P,  w.  €  PEj,  and  the  permutation  i  -  7T(i)  is  avail- 
able. Note  that  ir  may  either  be  specified  in  advance  or  be  part  of  the  data.  Our 
algorithm  is  derived  from  the  packing  algorithm  presented  in  [UC]:  The  idea  is 
for  each  w.  to  make  progress  towards  PE  ,..  one  bit  at  a  time. 

Each  PE.  maintains  a  queue  Qj  of  "data,  destination,  coimt"  triples.  Let  x 
be  the  triple  currently  at  the  front  of  Qj,  let  d  be  the  second  (destination)  com- 
ponent of  X,  and  let  k  be  the  third  (count)  component.  PE.  is  assumed  to  be  idle 
during  any  step  in  which  Qj  is  empty.  If  ij-j  ...ij  is  the  binary  representation  of  i, 
we  define  unshuffle(i)  =  iii^ij).!  ..-12.  ^^^  partner  (i)  =  in "  '  *  b^i  where  Tj  = 

The  algorithm  proceeds  as  follows:  Each  PE.  initializes  Q.  to  (w.,  ir  (i),  1) 
and  then  iterates  through  the  following  3  steps  until,  for  all  i,  w.  €  PE  ,.y 

(1)  If  dj^,  the  k^*^  bit  of  d,  and  i  have  different  parities,  then  send  x  to  the  rear  of 

^partner  (i)* 

(2)  If  d  =  i,  remove  x  (since  the  w  contained  therein  is  w^-l/i^ ). 

(3)  Set  k  =  k+ 1  and  send  x  to  the  rear  of  Q     .  ,„  .... 

3.   A  Crude  Analysis  of  the  Average  Case 

We  make  the  fundamental  assumptions  that  the  permutation  has  been 
selected  at  random  and  that  at  each  step  the  triples  are  randomly  distributed 
among  the  processors.  The  latter  assumption  should  be  reasonable  except  near 
the  beginning  and  end  of  the  algorithm's  execution. 

Using  the  fundamental  assiunptions,  it  is  easy  to  see  that  the  average  number 
of  non-empty  queues  is 

P  (1-(1-1/P)P )  c-  p  (1  .  i/e). 
Since  each  of  the  P  pairs  must  execute- steps  (1)  and  (3)  at  most  log  P  times  and 
at  each  iteration  of  steps  (1)  and  (3),  P(l-iye)  pairs  are  processed,  it  follows  that 
(log  P)  /  (1  -  1/e)  <='  1.581og  P  iterations  are  required.    Thus  the  number  of  data 
communications  steps  is  approximately  31og  P. 

4.   Extensions  and  Remarks 

We  can  use  an  analogous  algorithm  when  there  are  L  items  contained  in  each 
PE.   Then  the  expected  number  of  non-empty  queues  is 

P(l  -  (1  -  1/P)N)  «   P(l  -  1/e)^). 
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and  the  expected  number  of  data  communication  steps  =<  2LlogP/(l  -  (1/e)^). 

On  average  only  half  the  PEs  that  execute  step  (1)  actually  send  a  pair  to 
their  partners.  Thus,  if  we  allow  asynchronous  operation  and  ignore  the  issue  of 
data    contention,    the    expected    number    of    data    communication    steps    =' 

1.5LlogP/(l  -  (1/e)^). 

The  following  table  indicates  that  even  for  modest  L  both  the  original  and 
asynchronous  algorithms  approach  their  asymptotic  limits  rapidly.  Note  that, 
since  one  can  easily  find  a  permutation  that  routes  lVP  items  through  PEq,  the 
algorithms  that  we  have  proposed  exhibit  markedly  different  worst  case  and  aver- 
age case  behavior. 


L 

Original 

Asynchronous 

1 

3.16 

2.37 

2 

2.31 

1.73 

3 

2.10 

1.58 

4 

2.04 

1.53 

5 

2.01 

1.51 

Expected  number  of  data  communication  steps  divided  by  LlogP. 
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